Ch. 17 - NLP and Word Embeddings

Welcome to week 4! This week, we will take a look at natural language processing. From Wikipedia:

Natural language processing (NLP) is a field of computer science, artificial intelligence concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language data.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

While last week was about making computers able to see, this week is about making them able to read. This is useful in the financial industry where large amounts of information are usually presented in form of texts. Starting from ticker headlines, to news reports, to analyst reports all the way to off the record chit chat by industry figures on social media, text is in many ways at the very center of what the financial industry does. In this week, we will take a look at text classification problems and sentiment analysis.

Sentiment analysis with the IMDB dataset

Sentiment analysis is about judging how positive or negative the tone in a document is. The output of a sentiment analysis is a score between zero and one, where one means the tone is very positive and zero means it is very negative. Sentiment analysis is used for trading quite frequently. For example the sentiment of quarterly reports issued by firms is automatically analyzed to see how the firm judges its own position. Sentiment analysis is also applied to the tweets of traders to estimate an overall market mood. Today, there are many data providers that offer sentiment analysis as a service.

In principle, training a sentiment analysis model works just like training a binary text classifier. The text gets classified into positive (1) or not positive (0). This works exactly like other binary classification only that we need some new tools to handle text.

A common dataset for sentiment analysis is the corpus of Internet Movie Database (IMDB) movie reviews. Since each review comes with a text and a numerical rating, the number of stars, it is easy to label the training data. In the IMDB dataset, movie reviews that gave less then five stars where labeled negative while movies that gave more than seven stars where labeled positive (IMDB works with a ten star scale). Let's give the data a look:


In [2]:
import os

imdb_dir = './aclImdb' # Data directory
train_dir = os.path.join(imdb_dir, 'train') # Get the path of the train set

# Setup empty lists to fill
labels = []
texts = []

# First go through the negatives, then through the positives
for label_type in ['neg', 'pos']:
    # Get the sub path
    dir_name = os.path.join(train_dir, label_type)
    
    # Loop over all files in path
    for fname in os.listdir(dir_name):
        
        # Only consider text files
        if fname[-4:] == '.txt':
            # Read the text file and put it in the list
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            # Attach the corresponding label
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

We should have 25,000 texts and labels.


In [3]:
len(labels), len(texts)


Out[3]:
(25000, 25000)

Half of the reviews are positive


In [19]:
import numpy as np
np.mean(labels)


Out[19]:
0.5

Let's look at a positive review:


In [27]:
print('Label',labels[24002])
print(texts[24002])


Label 1
This film is brilliant! It touches everyone who sees it in an extraordinary way. It really takes you back to your youth and puts a new perspective on how you view your childhood memories. There are so many layers to this film. It is innovative and absolutely fabulous!

And a negative review:


In [28]:
print('Label',labels[1])
print(texts[1])


Label 0
Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question "why in Gods name would they create another one of these dumpster dives of a movie?" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think that dressing the people who had stared in the other movies up as though they we're from the wild west would make the movie (with the exact same occurrences) any better? honestly, i would never suggest buying this movie, i mean, there are cheaper ways to find things that burn well.

Tokenizing text

Computers can not work with words directly. To them, a word is just a meaningless row of characters. To work with words, we need to turn words into so called 'Tokens'. A token is a number that represents that word. Each word gets assigned a token. Tokens are usually assigned by word frequency. The most frequent words like 'a' or 'the' get tokens like 1 or 2 while less often used words like 'profusely' get assigned very high numbers.

We can tokenize text directly with Keras. When we tokenize text, we usually choose a maximum number of words we want to consider, our vocabulary so to speak. This prevents us from assigning tokens to words that are hardly ever used, mostly because of typos or because they are not actual words or because they are just very uncommon. This prevents us from over fitting to texts that contain strange words or wired spelling errors. Words that are beyond that cutoff point get assigned the token 0, unknown.


In [29]:
from keras.preprocessing.text import Tokenizer
import numpy as np

max_words = 10000 # We will only consider the 10K most used words in this dataset

tokenizer = Tokenizer(num_words=max_words) # Setup
tokenizer.fit_on_texts(texts) # Generate tokens by counting frequency
sequences = tokenizer.texts_to_sequences(texts) # Turn text into sequence of numbers

The tokenizers word index is a dictionary that maps each word to a number. You can see that words that are frequently used in discussions about movies have a lower token number.


In [34]:
word_index = tokenizer.word_index
print('Token for "the"',word_index['the'])
print('Token for "Movie"',word_index['movie'])
print('Token for "generator"',word_index['generator'])


Token for "the" 1
Token for "Movie" 17
Token for "Movie" 20126

Our positive review from earlier has now been converted into a sequence of numbers.


In [37]:
sequences[24002]


Out[37]:
[11,
 19,
 6,
 526,
 9,
 2437,
 313,
 34,
 1081,
 9,
 8,
 32,
 2812,
 93,
 9,
 63,
 301,
 22,
 142,
 5,
 126,
 1934,
 2,
 1454,
 3,
 159,
 1970,
 20,
 86,
 22,
 648,
 126,
 1546,
 1882,
 47,
 23,
 35,
 108,
 5914,
 5,
 11,
 19,
 9,
 6,
 3964,
 2,
 424,
 2734]

To proceed, we now have to make sure that all text sequences we feed into the model have the same length. We can do this with Keras pad sequences tool. It cuts of sequences that are too long and adds zeros to sequences that are too short.


In [39]:
from keras.preprocessing.sequence import pad_sequences
maxlen = 100 # Make all sequences 100 words long
data = pad_sequences(sequences, maxlen=maxlen)
print(data.shape) # We have 25K, 100 word sequences now


(25000, 100)

Now we can turn all data into proper training and validation data.


In [44]:
labels = np.asarray(labels)

# Shuffle data
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

training_samples = 20000  # We will be training on 10K samples
validation_samples = 5000  # We will be validating on 10000 samples

# Split data
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

Embeddings

As the attuned reader might have already guessed, words and word tokens are categorical features. As such, we can not directly feed them into the neural net. Just because a word has a larger token value, it does not express a higher value in any way. It is just a different category. Previously, we have dealt with categorical data by turning it into one hot encoded vectors. But for words, this is impractical. Since our vocabulary is 10,000 words, each vector would contain 10,000 numbers which are all zeros except for one. This is highly inefficient. Instead we will use an embedding.

Embeddings also turn categorical data into vectors. But instead of creating a one hot vector, we create a vector in which all elements are numbers.

In practice, embeddings work like a look up table. For each token, they store a vector. When the token is given to the embedding layer, it returns the vector for that token and passes it through the neural network. As the network trains, the embeddings get optimized as well. Remember that neural networks work by calculating the derivative of the loss function with respect to the parameters (weights) of the model. Through backpropagation we can also calculate the derivative of the loss function with respect to the input of the model. Thus we can optimize the embeddings to deliver ideal inputs that help our model.

In practice it looks like this: We have to specify how large we want the word vectors to be. A 50 dimensional vector is able to capture good embeddings even for quite large vocabularies. We also have to specify for how many words we want embeddings and how long our sequences are.


In [45]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

embedding_dim = 50

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 100, 50)           500000    
_________________________________________________________________
flatten_2 (Flatten)          (None, 5000)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 32)                160032    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 33        
=================================================================
Total params: 660,065
Trainable params: 660,065
Non-trainable params: 0
_________________________________________________________________

You can see that the embedding layer has 500,000 trainable parameters, that is 50 parameters for each of the 10K words.


In [46]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

In [47]:
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))


Train on 10000 samples, validate on 1000 samples
Epoch 1/10
10000/10000 [==============================] - 3s - loss: 0.5346 - acc: 0.7085 - val_loss: 0.3894 - val_acc: 0.8210
Epoch 2/10
10000/10000 [==============================] - 3s - loss: 0.1322 - acc: 0.9566 - val_loss: 0.4142 - val_acc: 0.8370
Epoch 3/10
10000/10000 [==============================] - 3s - loss: 0.0125 - acc: 0.9990 - val_loss: 0.4537 - val_acc: 0.8380
Epoch 4/10
10000/10000 [==============================] - 3s - loss: 0.0021 - acc: 1.0000 - val_loss: 0.4968 - val_acc: 0.8360
Epoch 5/10
10000/10000 [==============================] - 3s - loss: 0.0010 - acc: 1.0000 - val_loss: 0.5078 - val_acc: 0.8370
Epoch 6/10
10000/10000 [==============================] - 3s - loss: 6.1137e-04 - acc: 1.0000 - val_loss: 0.5245 - val_acc: 0.8360
Epoch 7/10
10000/10000 [==============================] - 3s - loss: 4.0845e-04 - acc: 1.0000 - val_loss: 0.5427 - val_acc: 0.8350
Epoch 8/10
10000/10000 [==============================] - 3s - loss: 2.8757e-04 - acc: 1.0000 - val_loss: 0.5567 - val_acc: 0.8360
Epoch 9/10
10000/10000 [==============================] - 3s - loss: 2.1033e-04 - acc: 1.0000 - val_loss: 0.5693 - val_acc: 0.8350
Epoch 10/10
10000/10000 [==============================] - 3s - loss: 1.5782e-04 - acc: 1.0000 - val_loss: 0.5830 - val_acc: 0.8340

Note that training your own embeddings is prone to over fitting. As you can see our model archives 100% accuracy on the training set but only 83% accuracy on the validation set. A clear sign of over fitting. In practice it is therefore quite rare to train new embeddings unless you have a massive dataset. Much more commonly, pre trained embeddings are used. A common pretrained embedding is GloVe, Global Vectors for Word Representation. It has been trained on billions of words from Wikipedia and the Gigaword 5 dataset, more than we could ever hope to train from our movie reviews. After downloading the GloVe embeddings from the GloVe website we can load them into our model:


In [48]:
glove_dir = './glove.6B' # This is the folder with the dataset

embeddings_index = {} # We create a dictionary of word -> embedding
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt')) # Open file

# In the dataset, each line represents a new word embedding
# The line starts with the word and the embedding values follow
for line in f:
    values = line.split()
    word = values[0] # The first value is the word, the rest are the values of the embedding
    embedding = np.asarray(values[1:], dtype='float32') # Load embedding
    embeddings_index[word] = embedding # Add embedding to our embedding dictionary
f.close()

print('Found %s word vectors.' % len(embeddings_index))


Found 400000 word vectors.

Not all words that are in our IMDB vocabulary might be in the GloVe embeddings though. For missing words it is wise to use random embeddings with the same mean and standard deviation as the GloVe embeddings


In [51]:
# Create a matrix of all embeddings
all_embs = np.stack(embeddings_index.values())
emb_mean = all_embs.mean() # Calculate mean
emb_std = all_embs.std() # Calculate standard deviation
emb_mean,emb_std


Out[51]:
(0.0044520083, 0.40815714)

We can now create an embedding matrix holding all word vectors.


In [52]:
embedding_dim = 100 # We now use larger embeddings

word_index = tokenizer.word_index
nb_words = min(max_words, len(word_index)) # How many words are there actually

# Create a random matrix with the same mean and std as the embeddings
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embedding_dim))

# The vectors need to be in the same position as their index. 
# Meaning a word with token 1 needs to be in the second row (rows start with zero) and so on

# Loop over all words in the word index
for word, i in word_index.items():
    # If we are above the amount of words we want to use we do nothing
    if i >= max_words: 
        continue
    # Get the embedding vector for the word
    embedding_vector = embeddings_index.get(word)
    # If there is an embedding vector, put it in the embedding matrix
    if embedding_vector is not None: 
        embedding_matrix[i] = embedding_vector

This embedding matrix can be used as weights for the embedding layer. This way, the embedding layer uses the pre trained GloVe weights instead of random ones. We can also set the embedding layer to not trainable. This means, Keras won't change the weights of the embeddings while training which makes sense since our embeddings are already trained.


In [58]:
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen, weights = [embedding_matrix], trainable = False))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (None, 100, 100)          1000000   
_________________________________________________________________
flatten_4 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_7 (Dense)              (None, 32)                320032    
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 33        
=================================================================
Total params: 1,320,065
Trainable params: 320,065
Non-trainable params: 1,000,000
_________________________________________________________________

Notice that we now have far fewer trainable parameters.


In [59]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

In [61]:
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))


Train on 10000 samples, validate on 1000 samples
Epoch 1/10
10000/10000 [==============================] - 2s - loss: 0.1271 - acc: 0.9584 - val_loss: 1.0561 - val_acc: 0.6800
Epoch 2/10
10000/10000 [==============================] - 2s - loss: 0.0986 - acc: 0.9730 - val_loss: 1.1435 - val_acc: 0.6840
Epoch 3/10
10000/10000 [==============================] - 2s - loss: 0.0764 - acc: 0.9836 - val_loss: 1.2219 - val_acc: 0.6760
Epoch 4/10
10000/10000 [==============================] - 2s - loss: 0.0647 - acc: 0.9874 - val_loss: 1.3341 - val_acc: 0.6920
Epoch 5/10
10000/10000 [==============================] - 2s - loss: 0.0485 - acc: 0.9932 - val_loss: 1.4150 - val_acc: 0.6860
Epoch 6/10
10000/10000 [==============================] - 2s - loss: 0.0378 - acc: 0.9969 - val_loss: 1.5264 - val_acc: 0.6740
Epoch 7/10
10000/10000 [==============================] - 2s - loss: 0.0329 - acc: 0.9970 - val_loss: 1.6081 - val_acc: 0.6720
Epoch 8/10
10000/10000 [==============================] - 2s - loss: 0.0255 - acc: 0.9984 - val_loss: 1.7429 - val_acc: 0.6710
Epoch 9/10
10000/10000 [==============================] - 2s - loss: 0.0201 - acc: 0.9993 - val_loss: 1.7580 - val_acc: 0.6750
Epoch 10/10
10000/10000 [==============================] - 2s - loss: 0.0148 - acc: 0.9995 - val_loss: 1.8321 - val_acc: 0.6720

Now our model over fits less but also does worse on the validation set.

Using our model

To determine the sentiment of a text, we can now use our trained model.


In [73]:
# Demo on a positive text
my_text = 'I love dogs. Dogs are the best. They are lovely, cuddly animals that only want the best for humans.'

seq = tokenizer.texts_to_sequences([my_text])
print('raw seq:',seq)
seq = pad_sequences(seq, maxlen=maxlen)
print('padded seq:',seq)
prediction = model.predict(seq)
print('positivity:',prediction)


raw seq: [[10, 116, 2515, 2515, 23, 1, 115, 33, 23, 1331, 1385, 12, 61, 178, 1, 115, 15, 1706]]
padded seq: [[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0   10  116
  2515 2515   23    1  115   33   23 1331 1385   12   61  178    1  115
    15 1706]]
positivity: [[ 0.9042114]]

In [81]:
# Demo on a negative text
my_text = 'The bleak economic outlook will force many small businesses into bankruptcy.'

seq = tokenizer.texts_to_sequences([my_text])
print('raw seq:',seq)
seq = pad_sequences(seq, maxlen=maxlen)
print('padded seq:',seq)
prediction = model.predict(seq)
print('positivity:',prediction)


raw seq: [[1, 3763, 7013, 77, 1144, 108, 389, 80]]
padded seq: [[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    1 3763 7013   77 1144  108
   389   80]]
positivity: [[ 0.23451628]]

Word embeddings as semantic geometry

One very interesting aspect of embeddings trained on large numbers of words is that they show patterns in which the geometric relationship between word vectors corresponds to the semantic relationship between these words.

In the picture above for instance you can see that the direction of feminine words to their male counterparts is roughly the same. In other words, if you where to substract the word vector for 'woman' from the word 'queen' and add the word vector for 'man' you would arrive at 'king'. This also works for other relationships like comparatives and superlatives.

This highlights some interesting properties of language in which semantic meanings can be seen as directions which can be added or subtracted.

A sad side effect of training word vectors on human writing is that it captures human biases. For example it has been shown that for word vectors trained on news websites, 'Programmer' - 'Man' + 'Woman' equals 'Homemaker' reflecting the bias in language that assigns the role of homemaker more often to woman than men. Measuring these biases in embeddings and correcting them has become a field of research on its own which highlights how even professional writing from news outlets can be biased.

Summary

In this chapter you have taken the first steps into natural language processing. You have learned about tokenization and word embeddings. You have learned how to train your own embeddings and how to load pre trained embeddings into your model.